.pn1
.po4
A VERY SIMPLE CHART PARSER

Peter Ross
Dept. of AI, University of Edinburgh,
80 South Bridge,
Edinburgh EH1 1HN,
Scotland.

This describes how to use the (very) simple chart parser written in Prolog
in the file "chart.pro". You will need LOTS of stack space - the program
does no assertion or retraction at run time, but keeps all the run-time
data in lists.

The program expects data in the form of Prolog clauses:

[1] rule(Tag, Category, ExpansionList).
        The 'Tag' argument is just an arbitrary Prolog term used to group
        together a set of rule/3 and lexical/3 clauses. The parser needs
        to be told this tag in order to know which set of these clauses
        to use as data. Since it is arbitrary, you could use a compound
        term to allow you to selectively include rules, at no extra cost.
        The 'Category' argument identifies a syntactic category, e.g.
        noun or sentence.
        The 'ExpansionList' is a list giving one possible expansion of that
        category into sub-categories.
        Example:
                rule(simple, sentence, [noun_phrase, verb_phrase]).
        The categories (both Category and those in ExpansionList) can be
        compound terms, thus allowing the use of variables for result passing
        and context setting. The parser copies terms as needed, when
        unification might be a bad thing to do because it might cause
        supposedly independent data structures to become too specific.
        This copying, by the way, is done by dismantling the source structure
        and building a duplicate rather than by the crude assert/retract
        method; the latter tends to be much slower but takes less run-time
        space.

        NOTE: categories are not supposed to be defined recursively.
        Nevertheless, the code does check whether you are adding an edge you
        already have. The check does slow things a bit, but is specifically
        a check against left recursion at the time when empty active edges
        are added. A fuller check against adding duplicate inactive edges
        has been commented out; you probably don't need it unless you mess
        with the rule tags at run time so as to cause new parsing rules to
        be introduced during parsing.

[2] lexical(Tag, Word, CategoryList).
        The 'Tag' argument is as above.
        The 'Word' is a word that may legitimately appear in the input.
        The 'CategoryList' is a list of the possible syntactic categories
        which that word might fit, e.g. 'fly' might be a noun or verb.
        As above, the categories can be arbitrary terms.

[3] initial_category(Tag, Category).
        This states what the highest level category is - it should be unique.Š
[4] strategy(Tag, Strategy).
        This should instantiate Strategy to either 'td' for top-down or
        'bu' for bottom-up. The parser does not check it for validity.

[5] policy(Tag, Policy).
        This should instantiate Policy to either 'df' for depth-first or
        'bf' for breadth-first. The parser does not check it for validity.

[6] ersatz_category(Tag, DummyCategoryName).
        By default, the category name 'user' is specially used by the system.
        It supposes that there is an extra rule of the form
                rule(Tag, user, [TopLevelCategory])
        when doing top-down parsing. This extra category will of course appear
        in the final chart, and should help in the later analysis of it. If
        you are really attached to the category name 'user' but want to do
        some top-down parsing, then you can persuade the system to use a
        different name instead by including a clause of this form.
        Although this ersatz category can be useful, its original purpose
        was to make the program shorter at negligible speed cost by ensuring
        that the top-level category appeared on the left of a single rule.

IMPORTANT: The rule/3 and lexical/3 predicates are only used during the
           initialisation stage, when the transformed rules are constructed
           and when the initial chart is created.

There are three important data structures manipulated by the program:

[a] edge(Category, FoundList, NeedsList, StartVertex, EndVertex).
        This describes an edge in the chart. If you don't know what this
        means, read Winograd's book "Language As A Cognitive Process, vol 1"
        or the articles by Henry Thompson and Graeme Ritchie in "Artificial
        Intelligence: Tools, Techniques and Applications" by O'Shea and
        Eisenstadt (published by Harper and Row).
        There are three odd details. Items on the 'FoundList' are of the form
                Category = VertexNumber
        (the starting, left-hand vertex of that instance of the category).
        A word from the input will appear in the 'FoundList' as
                word(Word) = VertexNumber
        The 'FoundList' is ordered so that the most recently found category is
        first.

[b] the chart: ActiveEdgeList + InactiveEdgeList.
        The edges are segregated into two lists for convenience.
        The parser returns such a chart when it finishes.

[c] the agenda: CandidateList - Hole.
        This is a difference list (a list with a variable at the end - 'Hole'
        is the variable). The items in the list are of the form
                ActiveEdge + InactiveEdge
        and are candidates (guaranteed successful, in fact) for an application
        of the fundamental rule of chart parsing.
        The default monitoring system prints out agendas.

The program has two top level predicates. They are:Š
[i]   parse(Tag, WordList, MaxVertex, Chart).
        The Tag and WordList must be given. The MaxVertex and Chart must be
        variables. When the parse is done, MaxVertex will be instantiated to
        the largest vertex number (the lowest is 0), and Chart will be
        instantiated to the final chart. When doing top-down parsing, the
        parser adds one ersatz rule of the form
                rule(Tag,user,[TopMostCategory])
        - there will be edges for the ersatz category 'user' to help you to
        examine the chart afterwards for successful parsings. You can
        substitute what you want for 'user' - see above. This predicate does
        some transformations on the rule/3 clauses supplied by the user. The
        transformations are all done by the predicate invert_rules(Tag).

[ii]  make_chart(Tag, WordList, MaxVertex, Chart).
        This is exactly like parse/4 above, but assumes that the rule
        transformations have already been done.

When a parse has been done and a final chart has been produced, you may
want to add to the word list and extend the chart, for example if using the
parser for plan recognition. You can do so by the following means: call
        initial_setup(Tag, Strategy, Policy, NewWordList,
                      MinVertex, NewMaxVertex,
                      OldChart, NewInitialChart,
                      AgendaOfYourChoice, NewAgenda),
        chart(Tag, Strategy, Policy, NewInitialChart, NewAgenda, FinalChart)
This will assume that more words are added between MinVertex and NewMaxVertex,
and will add to the OldChart and AgendaOfYourChoice you gave it, to create
a new FinalChart. Since all the information needed about the presence of
any previous set of words (presumably ending at MinVertex) will be contained
in the OldChart you got out of parsing that previous word sequence, this will
incrementally add to the chart.

Normally parsing stops when the parser runs out of agenda. You may discover
a need to suspend parsing under program control, for example if you want
to doctor the chart or extract useful information from it before the entire
input sequence is available. If so, you may redefine the test
        stop_parser(Tag,Strategy,Policy,Chart,Agenda,FinalChart)
which currently does no more than instantiate the result FinalChart to
Chart when Agenda is empty. It is the success of this predicate which halts
parsing.

The implementation of the fundamental rule is explicit within the code, so
that it is easy to change for your own purposes.


There is a simple monitoring mechanism. The predicates
        watch(Tag)      nowatch(Tag)
turn it on or off for the specified rule set. If on, it reports when the
rule transformations are complete, and whenever an item on the agenda is
processed, it tries to call the predicate
        user_mon(Tag,Strategy,Policy,OldChart,OldAgenda,NewChart,NewAgenda)
If this fails, it uses a default strategy of printing out the new chart and
agenda and waiting for a <cr> before continuing.
ŠThere is a simple test rig, called by the predicate
        test(Tag).
It makes use of a trivial utility
        print_chart(Chart)
to print out the active and inactive edges after the parse has finished.
The lists of edges are sorted before being printed, into the standard order
of Prolog terms.

A sample rule set can be found in the files "test01.pro" and "test02.pro".


IMPLEMENTATION: SOME THOUGHTS

This program clearly shows the need for some kind of user-definable indexing
in Prolog in order to make such things efficient. Unfortunately record/3 is
not much use, because the key can only usefully be atomic (only the principal
functor name matters) and in C-Prolog the key cannot be an integer (nor is
record/3 that fast in C-Prolog). At present all you've really got is the
clause indexing mechanism, so you're stuck with assert/retract and all that
that implies if you want to cut down on searching through big data structures
such as long lists. Gimme arrays, gimme hunks, gimme a decent record package,
gimme hash tables, gimme something!! Prolog-2 does let you specify hash tables
for particular predicates, but these are only used by the calling mechanism.
Wouldn't it be nice to have some way of specifying that anything with a given
principal functor was to be accessed through a set of hash tables,
sub-indexing specifiable by the user in some way, e.g. akin to that offered
by PEARL for frame access? Some other Prologs (e.g. Arity Prolog) do now
offer such things as B-trees and hash tables for term storage.

At the moment, for simplicity, the chart is only represented as two lists -
one of active edges, one of inactive edges. Since the chart only grows, never
shrinks, it might be better to represent it as two ordered trees peppered with
holes (variables). This ought to speed things up a lot.

I considered the idea of not allowing Prolog variables in the terms which
represent categories; this would end all temptation to misuse unification.
Instead, I have in the past used terms of the form ?Atom as variables (where
?/1 is defined as a prefix operator, none of this messy atom exploding done
by LISPers who don't/can't alter their character tables). This leads to
passing binding lists about a lot. As far as the ersatz unification in edge
generation goes, there is no significant speed difference between using
Prolog variables and these pseudo-variables. However, there is a difference
in speed when looking for candidate edge pairs, since with Prolog variables
I can just use the dodge of specifying \+(\+(Category1=Category2)), which
checks whether they will unify, without doing so, in a reasonably efficient
way. With pseudo-variables I'd need to indulge in term smashing or suffer
the overhead of carrying binding lists about in the chart. The latter is not
so awful as it might look, since it would be possible to construct the unifier
at the same time as checking candidacy, and just carry the unifier about till
needed. But, this still looks a bit worse than using Prolog variables.


